Creating language resources for under-resourced languages: methodologies, and experiments with Arabic

نویسندگان

  • Mahmoud El-Haj
  • Udo Kruschwitz
  • Chris Fox
چکیده

Language resources are important for those working on computational methods to analyse and study languages. These resources are needed to help advancing the research in fields such as natural language processing, machine learning, information retrieval and text analysis in general. We describe the creation of useful resources for languages that currently lack them, taking resources for Arabic summarisation as a case study. We illustrate three different paradigms for creating language resources, namely: (1) using crowdsourcing to produce a small resource rapidly and relatively cheaply; (2) translating an existing gold-standard dataset, which is relatively easy but potentially of lower quality; and (3) using manual effort with appropriately skilled human participants to create a resource that is more expensive but of high quality. The last of these was used as a test collection for TAC-2011. An evaluation of the resources is also presented. The current paper describes and extends the resource creation activities and evaluations that underpinned experiments and findings that have previously appeared as an LREC workshop paper (El-Haj et al 2010), a student conference paper (El-Haj et al 2011b), and a description of a multilingual summarisation pilot (El-Haj et al 2011c; Giannakopoulos et al 2011). M. El-Haj School of Computing and Communications, Lancaster University, UK Tel.: +44(0)1524 51 0348 E-mail: [email protected] U. Kruschwitz CSEE, University of Essex, UK Tel.: +44 (0)1206 87 2669 E-mail: [email protected] C. Fox CSEE, University of Essex, UK Tel.: +44 (0)1206 87 2576 E-mail: [email protected] 2 El-Haj, Kruschwitz and Fox

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Lexicon+TX: rapid construction of a multilingual lexicon with under-resourced languages

Most efforts at automatically creating multilingual lexicons require input lexical resources with rich content (e.g. semantic networks, domain codes, semantic categories) or large corpora. Such material is often unavailable and difficult to construct for under-resourced languages. In some cases, particularly for some ethnic languages, even unannotated corpora are still in the process of collect...

متن کامل

Considering a resource-light approach to learning verb valencies

Here we describe work on learning the subcategories of verbs in a morphologically rich language using only minimal linguistic resources. Our goal is to learn verb subcategorizations for Quechua, an under-resourced morphologically rich language, from an unannotated corpus. We compare results from applying this approach to an unannotated Arabic corpus with those achieved by processing the same te...

متن کامل

Tagging Ingush - Language Technology For Low-Resource Languages Using Resources From Linguistic Field Work

This paper presents on-going work on creating NLP tools for under-resourced languages from very sparse training data coming from linguistic field work. In this work, we focus on Ingush, a Nakh-Daghestanian language spoken by about 300,000 people in the Russian republics Ingushetia and Chechnya. We present work on morphosyntactic taggers trained on transcribed and linguistically analyzed recordi...

متن کامل

Cross-language F0 modeling for under-resourced tonal languages: a case study on Thai-Mandarin

This paper proposed a novel method for F0 modeling in under-resourced tonal languages. Conventional statistical models require large training data which are deficient in many languages. In tonal languages, different syllabic tones are represented by different F0 shapes, some of them are similar across languages. With cross-language F0 contour mapping, we can augment the F0 model of one under-re...

متن کامل

Employing Pivot Language Technique through Statistical and Neural Machine Translation Frameworks: the Case of Under-resourced Persian-spanish Language Pair

The quality of Neural Machine Translation (NMT) systems like Statistical Machine Translation (SMT) systems, heavily depends on the size of training data set, while for some pairs of languages, high-quality parallel data are poor resources. In order to respond to this low-resourced training data bottleneck reality, we employ the pivoting approach in both neural MT and statistical MT frameworks. ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Language Resources and Evaluation

دوره 49  شماره 

صفحات  -

تاریخ انتشار 2015